The recent global pandemic situation has forced masses to rely on the internet now more than ever, for activities as simple as online shopping to online education. Toxic comments are detrimental to internet users and limit their freedom of expression in diverse perspectives, and such unconstructive remarks are discourteous, disrespectful, and detestable. Such animosity results in users disabling their comments on many online communities, and eventually, they stop expressing their opinions. Hence, there is a pressing need to build a fast, precise, and deployable solution to help organizations weed out such remarks from getting online.
At present, there are a plethora of state-of-the-art pre-trained classification models such as BERT are available to us for performing classification. Although these models are very accurate, they are huge and require extensive computational resources for training. Their massive size makes them less useful for deploying them on constricted environments such as mobile devices. Through this project, we aim to learn how to perform multi-label text classification and build a naïve solution that is fast, light-weight, i.e., suitable for deploying in constricted environments, and highly accurate in classifying the comments.
Dataset is obtained from Kaggle as a part of the [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) hosted by Jigsaw/Conversation AI in 2017.
The dataset used here is from wiki corpus dataset which was rated by human raters for toxicity. The corpus contains 63M comments from discussions relating to user pages and articles dating from 2004-2015.
Different platforms/sites can have different standards for their toxic screening process. Hence the comments are tagged in the following six categories:
The tagging is done via crowdsourcing which means that the dataset is rated by different people which in turn implies there is probable cause for low accuracy.
Our notebook is divided into the following sections
import numpy as np
import pandas as pd
# visualizations
from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
import plotly.io as pio
pio.templates.default = "presentation"
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
# ordered Dictionary
from collections import OrderedDict
# word cloud
from wordcloud import WordCloud, STOPWORDS
# regex
import re
# nltk
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
# score
from sklearn.metrics import hamming_loss
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
import pickle
import tensorflow as tf
from tensorflow.random import set_seed
set_seed(18)
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, Dense, Activation,Bidirectional, LSTM, TimeDistributed, Dropout,Input
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from keras.utils import plot_model
# display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# warnings
import warnings
warnings.filterwarnings("ignore")
training_dataset = pd.read_csv('train.csv', encoding='utf-8')
# check the shape of the data.
td_row,td_col = training_dataset.shape
print("The total number of rows and columns in the training dataset are: {} and {}".format(td_row,td_col))
# check for missing values
print("Are there any missing values in the training dataset: {}".format(training_dataset.isnull().values.any()))
# check the top 10 rows of the dataset
training_dataset.head(10)
test_dataset = pd.read_csv('test.csv', encoding='utf-8')
# check the shape of the data.
test_row,test_col = test_dataset.shape
print("The total number of rows and columns in the test dataset are: {} and {}".format(test_row,test_col))
# check for missing values
print("Are there any missing values in the test dataset: {}".format(test_dataset.isnull().values.any()))
test_dataset.head(10)
test_labels = pd.read_csv('test_labels.csv', encoding='utf-8')
test_labels.head(10)
final_test_dataset = test_dataset.merge(test_labels,on='id',how='inner')
# filter out the -1 rows
final_test_dataset = final_test_dataset[final_test_dataset['toxic'] != -1]
final_test_dataset.head()
final_test_dataset.shape
# since a comment can be belong to multiple categories as evident from above, getting row wise count.
sum_per_row = training_dataset.iloc[:,2:].sum(axis=1)
comments_without_labels= len(sum_per_row[sum_per_row==0]) # clean comments no label associated with them.
comments_with_labels = len(training_dataset) - comments_without_labels
comment_count = [comments_with_labels,comments_without_labels]
comment_distribution = go.Figure([
go.Bar(
x=['Class 1 (Toxic)','Class 0 (Non-Toxic)'],
y=comment_count,
text=comment_count,
textposition='outside',
marker_color=['crimson','forestgreen']
)
])
comment_distribution.update_layout(
title='Toxic vs Non-Toxic Comment Count',
xaxis_title='Label Type',
yaxis_title='Count',
width=500,
height=500,
font=dict(size=10)
)
comment_distribution.show()
# extract the names of the columns containing labels
comment_column_names = list(training_dataset.columns[2:])
count_dictionary = OrderedDict() # store the count for individual classes
class_1_labels = 0
for col_name in comment_column_names:
class_1_labels += len([item for item in training_dataset[col_name] if item == 1])
count_dictionary[col_name.title()] = len([item for item in training_dataset[col_name] if item == 1])
clrs = px.colors.sequential.Plasma
layout = go.Layout( margin=go.layout.Margin( l=300 ) )
x_val = list(count_dictionary.keys())
y_val = list(count_dictionary.values())
cat_label_count = go.Figure([go.Bar(x=x_val,
y=y_val,
text=y_val,
textposition='outside',
marker_color=clrs)
])
cat_label_count.update_layout(title='Toxic comments per category', xaxis_title='Categories', yaxis_title='Count', width=650, height=500, font=dict(size=10), xaxis={'categoryorder':'array'})
cat_label_count.show()
# check if the comment has been labeled twice.
print('Does the dataset contains comments having multiple labels?: {}'.format(int(comments_with_labels) != int(class_1_labels)))
# six categories
toxic_comments_text = training_dataset[training_dataset.toxic == 1]['comment_text'].values
severe_toxic_comments_text = training_dataset[training_dataset.severe_toxic == 1]['comment_text'].values
obscene_comments_text = training_dataset[training_dataset.obscene == 1]['comment_text'].values
threat_comments_text = training_dataset[training_dataset.threat == 1]['comment_text'].values
insult_comments_text = training_dataset[training_dataset.insult == 1]['comment_text'].values
identity_hate_comments_text = training_dataset[training_dataset.identity_hate == 1]['comment_text'].values
wordcloud_stopword=set(STOPWORDS)
plt.figure(figsize=(20,10))
# toxic
plt.subplot(321)
toxic_comments_wordcloud = WordCloud(min_font_size = 2, max_font_size = 50, font_step=1, max_words=1000, background_color = "white",stopwords=wordcloud_stopword).generate("".join(toxic_comments_text))
plt.axis("off")
plt.title("Common Words In Comments As Labeled Toxic", fontsize=14)
plt.imshow(toxic_comments_wordcloud)
# severe toxic
plt.subplot(322)
severe_toxic_comments_wordcloud = WordCloud(min_font_size = 2, max_font_size = 50, font_step=1, max_words=1000, background_color = "white",stopwords=wordcloud_stopword).generate("".join(severe_toxic_comments_text))
plt.axis("off")
plt.title("Common Words In Comments Labeled As Severe Toxic ", fontsize=14)
plt.imshow(severe_toxic_comments_wordcloud)
# obscene
plt.subplot(323)
obscene_comments_wordcloud = WordCloud(min_font_size = 2, max_font_size = 50, font_step=1, max_words=1000, background_color = "white",stopwords=wordcloud_stopword).generate("".join(obscene_comments_text))
plt.axis("off")
plt.title("Common Words In Comments Labeled As Obscene", fontsize=14)
plt.imshow(obscene_comments_wordcloud)
# threat
plt.subplot(324)
threat_comments_wordcloud = WordCloud(min_font_size = 2, max_font_size = 50, font_step=1, max_words=1000, background_color = "white",stopwords=wordcloud_stopword).generate("".join(threat_comments_text))
plt.axis("off")
plt.title("Common Words In Comments Labeled As Threat", fontsize=14)
plt.imshow(threat_comments_wordcloud)
# insult
plt.subplot(325)
insult_comments_wordcloud = WordCloud(min_font_size = 2, max_font_size = 50, font_step=1, max_words=1000, background_color = "white",stopwords=wordcloud_stopword).generate("".join(insult_comments_text))
plt.axis("off")
plt.title("Common Words In Comments Labeled As Insult", fontsize=14)
plt.imshow(insult_comments_wordcloud)
# identity hate
plt.subplot(326)
identity_hate_comments_wordcloud = WordCloud(min_font_size = 2, max_font_size = 50, font_step=1, max_words=1000, background_color = "white",stopwords=wordcloud_stopword).generate("".join(identity_hate_comments_text))
plt.axis("off")
plt.title("Common Words In Comments Labeled As Identity Hate", fontsize=14)
plt.imshow(identity_hate_comments_wordcloud)
plt.show()
# word clouds above indicate that the subcategories may be correlated. Since most of the comments have label toxic, checking the correlation between toxic and other labels.
correlation_matrices = list()
correlation_dataset = training_dataset[sum_per_row != 0]
for col_name in training_dataset.columns[2:]:
confusion_matrix = pd.crosstab(correlation_dataset['toxic'], correlation_dataset[col_name])
correlation_matrices.append(confusion_matrix)
# generate df
correlation_df = pd.concat(correlation_matrices,axis=1,keys=training_dataset.columns[2:])
correlation_df
# adding a column that contains all the labels assigned to one column as a list.
training_dataset['all_labels'] = ''
for i in range(len(training_dataset)):
labels = list()
if int(training_dataset['toxic'][i]) == 1:
labels.append('toxic')
if int(training_dataset['severe_toxic'][i]) == 1:
labels.append('severe_toxic')
if int(training_dataset['obscene'][i]) == 1:
labels.append('obscene')
if int(training_dataset['threat'][i]) == 1:
labels.append('threat')
if int(training_dataset['insult'][i]) == 1:
labels.append('insult')
if int(training_dataset['identity_hate'][i]) == 1:
labels.append('identity_hate')
training_dataset['all_labels'][i] = labels
training_dataset[(training_dataset['toxic'] == 1) | (training_dataset['threat'] == 1)].iloc[:,1:].head(5)
# Sample text
training_dataset['comment_text'].head(5)
# regex pattern for removing stop words
remove_stop_words = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
def clean_comment_text(comment):
# remove stop words
processed_comment = remove_stop_words.sub('',comment)
# strip newlines if any
processed_comment = processed_comment.rstrip('\r\n')
# remove punctuations and numbers
processed_comment = re.sub('[^a-zA-Z]', ' ', processed_comment)
# remove single characters
processed_comment = re.sub(r"\s+[a-zA-Z]\s+", ' ', processed_comment)
# remove multiple spaces
processed_comment = re.sub(r'\s+', ' ', processed_comment)
return processed_comment
# Clean training dataset
training_dataset['cleaned_comment_text'] = training_dataset['comment_text'].apply(clean_comment_text)
# original vs cleaned comment
training_dataset[['comment_text','cleaned_comment_text']].head(10)
# Clean test dataset
final_test_dataset['cleaned_comment_text'] = final_test_dataset['comment_text'].apply(clean_comment_text)
# original vs cleaned comment
final_test_dataset[['comment_text','cleaned_comment_text']].head(10)
# Clean comments
clean_comments_text = training_dataset[sum_per_row == 0]['cleaned_comment_text'].values
# Toxic comments
toxic_comments_text = training_dataset[sum_per_row != 0]['cleaned_comment_text'].values
# get the counts
clean_comments_counts = [len(item) for item in clean_comments_text]
toxic_comments_counts = [len(item) for item in toxic_comments_text]
print("Clean Comment Text lengths: Maximum is {} words and Minimum is {} words".format(max(clean_comments_counts), min(clean_comments_counts)))
print("Toxic Comment Text lengths: Maximum is {} words and Minimum is {} words".format(max(toxic_comments_counts), min(toxic_comments_counts)))
x_words = list(range(1,training_dataset.shape[0]))
clean_comments_count_figure = go.Figure(data=[go.Histogram(x=clean_comments_counts, xbins=dict(start=0,end=5000,size=300), marker_color='forestgreen')])
clean_comments_count_figure.update_layout(title='Number Of Words Per Clean Comment', xaxis_title='Word Count', yaxis_title='Comment Count', width=650, height=500, font=dict(size=10))
clean_comments_count_figure.show()
toxic_comments_count_figure = go.Figure(data=[go.Histogram(x=toxic_comments_counts, xbins=dict(start=0,end=5000,size=300), marker_color='crimson')])
toxic_comments_count_figure.update_layout(title='Number Of Words Per Toxic Comment', xaxis_title='Word Count', yaxis_title='Comment Count', width=650, height=500, font=dict(size=10))
toxic_comments_count_figure.show()
words_per_comment = list(training_dataset["cleaned_comment_text"].apply(lambda x: len(str(x).split())))
unique_words_per_comment = list(training_dataset["cleaned_comment_text"].apply(lambda x: len(set(str(x).split()))))
x_words2 = list(range(1,training_dataset.shape[0]))
word_counts = go.Figure(data=[
go.Scatter(x=x_words2, y=words_per_comment,line = dict(color='#0077bb'), name='Word Count'),
go.Scatter(x=x_words2, y=unique_words_per_comment,line = dict(color='orange'), name='Unique Word Count')
])
word_counts.update_layout(title='Word Count vs Unique Word Count',xaxis_title='Count', yaxis_title='Categories',height=500, font=dict(size=10))
word_counts.show()
The primary difference between then Multi-Class and Multi-Label classification is that there are multiple categories in multi-class classification, but each instance is assigned only one of the categories. However, for multi-label classification, each instance can be assigned with multiple categories, and the categories are somehow related.
We experimented with commonly used multi-label classification approach to build our models. They are:
In this method, multi-label classification is transformed into single-label classification. This method can be carried out in three different ways as:
The problem is decomposed into multiple binary classification problems, in which the labels should be mutually exclusive of each other. We pick one class and train a binary classifier with the selected class samples on one side and all the other samples on the other side.
This is the simplest technique, which treats each label as a separate single class classification. The response is broken into 6 different single class classification problems, one per class. It is most simple and efficient method but the only drawback is that it doesn’t consider labels correlation because it treats every target variable independently.

Here, the first classifier is trained on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain. In classifier chains, this response is transformed into 6 different single label classifier. Although it is quite similar to binary relevance, the only difference is it forms chains in order to preserve label correlation.

In this method, the classifer is transformed into a multi-class problem with one multi-class classifier trained on all unique label combinations found in the training data. So, label powerset gives a unique class to every possible label combination that is present in the training set. The disadvantage of this method is that as the training data increases, number of labels grows exponentially. Thus, increasing the model complexity, and would result in a lower accuracy. In our case, we have nearly 150000 records in the training dat and increasing this size would only impact the model negatively.

Neural network models can be configured to support multi-label classification and perform well, depending on the classification task's specifics. Multi-label classification can be supported directly by neural networks by merely specifying the number of target labels in the problem as the number of nodes in the output layer.
For example, a task with three output labels (classes) will require a neural network output layer with three nodes in the output layer. Each node in the output layer must use the sigmoid activation.
The NN will predict a probability of class membership for the label, a value between 0 and 1. Finally, the model must be fit with the binary cross-entropy loss function.
Use an ensemble of models to perform classification
X = training_dataset['cleaned_comment_text']
y = training_dataset[comment_column_names].values
# Split features and response into training and validation data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=862)
# Define X_test
X_test = final_test_dataset['cleaned_comment_text']
# Define y_test
y_test = final_test_dataset[comment_column_names].values
# Define the TFIDF vectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True,
strip_accents='unicode',
analyzer='word',
token_pattern=r'\w{1,}',
stop_words='english',
ngram_range=(1, 3),
)
# Apply on X_train (fit and transform) and X_valid (transform) features
X_train_vectorized = vectorizer.fit_transform(X_train)
X_valid_vectorized = vectorizer.transform(X_valid)
# Similarly, vectorize cleaned comments of the test data
X_test_vectorized = vectorizer.transform(final_test_dataset['cleaned_comment_text'])
# Define y_test from the cleaned comments of the test data
y_test = final_test_dataset[comment_column_names].values
# Instantiate OneVsRestClassifier using Logistic Regression
ovs_classifier = OneVsRestClassifier(LogisticRegression(class_weight='balanced', C=12, random_state=862),n_jobs=-1)
# Fit on train data
ovs_classifier.fit(X_train_vectorized, y_train)
# Predict on validation data
ovs_predictions = ovs_classifier.predict(X_valid_vectorized)
# Print classification report
print(classification_report(y_valid, ovs_predictions))
# Predict on test data
test_ovs_predictions = ovs_classifier.predict(X_test_vectorized)
# Print classification report
print(classification_report(y_test, test_ovs_predictions))
print('Hamming Loss for OneVsRest Classifier using Logistic Regression is: {:.2f}'.format(hamming_loss(y_test, test_ovs_predictions)))
print('Accuracy for OneVsRest Classifier using Logistic Regression is: {:.2f}'.format(accuracy_score(y_test, test_ovs_predictions)))
# Instantiate OneVsRestClassifier using MultinomialNB
ovs_classifier_nb = OneVsRestClassifier(MultinomialNB(),n_jobs=-1)
# Fit on train data
ovs_classifier_nb.fit(X_train_vectorized, y_train)
# Predict on validation data
ovs_predictions_nb = ovs_classifier_nb.predict(X_valid_vectorized)
# Print classification report
print(classification_report(y_valid, ovs_predictions_nb))
# Predict on test data
test_ovs_predictions_nb = ovs_classifier_nb.predict(X_test_vectorized)
# Print classification report
print(classification_report(y_test, test_ovs_predictions_nb))
print('Hamming Loss for OneVsRest Classifier using Multinomial Naïve Bayes is: {:.2f}'.format(hamming_loss(y_test, test_ovs_predictions_nb)))
print('Accuracy for OneVsRest Classifier using Multinomial Naïve Bayes is: {:.2f}'.format(accuracy_score(y_test, test_ovs_predictions_nb)))
This is the implementation using sklearn library's predefined module for performing Binary Relevance. However, the below code killed the kernel multiple times and is not working. Hence we adopted an alternate approach to evaluate the model performance
#br_classifier = BinaryRelevance(LogisticRegression(class_weight='balanced', C=12, random_state=862),n_jobs=-1)
#br_classifier.fit(X_train_vectorized, y_train)
#br_predictions = br_classifier.predict(X_valid_vectorized)
#print(classification_report(y_valid, br_predictions))
train_text = training_dataset['cleaned_comment_text']
test_text = final_test_dataset['cleaned_comment_text']
tfidf = TfidfVectorizer(
sublinear_tf=True,
strip_accents='unicode',
analyzer='word',
token_pattern=r'\w{1,}',
stop_words='english',
ngram_range=(1, 3),
max_features=10000)
X_train_2 = tfidf.fit_transform(train_text)
X_test_2 = tfidf.transform(test_text)
toxicity_labels = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
test_dataset_col_names = list(final_test_dataset.columns[:2])
# make a copy of test data
temp_test_dataset = final_test_dataset.copy()
temp_test_dataset = temp_test_dataset[test_dataset_col_names]
temp_test_dataset.head(2)
# make a copy of test data to hold predictions
result1 = temp_test_dataset.copy()
result1['toxic'] = ''
result1['severe_toxic'] = ''
result1['obscene'] = ''
result1['threat'] = ''
result1['insult'] = ''
result1['identity_hate'] = ''
result1.shape
result2 = result1.copy()
result3 = result1.copy()
result4 = result1.copy()
# Evaluate model performance
def evaluate_score(y,y_pred,label):
print('Printing results for category: {}'.format(label))
hamm_loss = hamming_loss(y,y_pred)
print("Hamming Loss:",hamm_loss)
f1 = f1_score(y,y_pred, average='macro')
print("F1 Score :",f1)
clf1 = LogisticRegression(C=12.0) # Instantiate Logistic Regression
for label in toxicity_labels:
y = training_dataset[label]
# train model using X_train and y for each class
model1 = clf1.fit(X_train_2, y)
# compute evaluation scores for the training set
y_pred = model1.predict(X_train_2)
print('-------------------------------------------------------------')
print('Classification Report for {}'.format(label))
print(classification_report(y,y_pred))
y_test_probability = model1.predict_proba(X_test_2)[:,1]
result1[label] = y_test_probability
result1_values = result1[comment_column_names].values
# transform into required format: convert the probabilities obtained to 0's and 1's using a threshold of 0.5
y_br_lr = list()
for prediction in result1_values:
temp_list = list()
for item in prediction:
if item > float(0.5):
temp_list.append(1)
else:
temp_list.append(0)
y_br_lr.append(temp_list)
print(classification_report(y_test, y_br_lr))
print('Hamming Loss for Binary Relevance Classifier using Logistic Regression is: {:.2f}'.format(hamming_loss(y_test, y_br_lr)))
print('Accuracy for Binary Relevance Classifier using Logistic Regression is: {:.2f}'.format(accuracy_score(y_test, y_br_lr)))
clf2 = MultinomialNB() # Instantiate Naive Bayes
for label in toxicity_labels:
y = training_dataset[label]
# train model using X_train and y for each class
model2 = clf2.fit(X_train_2, y)
# compute evaluation scores for the training set
y_pred = model1.predict(X_train_2)
print('-------------------------------------------------------------')
print('Classification Report for {}'.format(label))
print(classification_report(y,y_pred))
y_test_probability = model2.predict_proba(X_test_2)[:,1]
result2[label] = y_test_probability
result2_values = result2[comment_column_names].values
# transform into required format
y_br_nb = list()
for prediction in result2_values:
temp_list = list()
for item in prediction:
if item > float(0.5):
temp_list.append(1)
else:
temp_list.append(0)
y_br_nb.append(temp_list)
# Classification report for Binary Relevance using Logistic Regression
print(classification_report(y_test, y_br_nb))
print('Hamming Loss for Binary Relevance Classifier using Multinomial Naïve Bayes is: {:.2f}'.format(hamming_loss(y_test, y_br_nb)))
print('Accuracy for Binary Relevance Classifier using Multinomial Naïve Bayes is: {:.2f}'.format(accuracy_score(y_test, y_br_nb)))
Similar to Binary Relevance, Classifier chain module killed the kernel multiple times and is not working. Hence we adopted an alternate approach to evaluate the model performance
#cc_classifier = ClassifierChain(LogisticRegression(class_weight='balanced', C=12, random_state=862),n_jobs=-1)
#cc_classifier.fit(X_train_vectorized, y_train)
#cc_predictions = cc_classifier.predict(X_valid_vectorized)
#print(classification_report(y_valid, cc_predictions))
# create a function to add features
def add_feature(X, feature_to_add):
'''
Returns sparse feature matrix with added feature.
feature_to_add can also be a list of features.
'''
from scipy.sparse import csr_matrix, hstack
return hstack([X, csr_matrix(feature_to_add).T], 'csr')
# source: - https://www.kaggle.com/rhodiumbeng/classifying-multi-label-comments-0-9741-lb?select=sample_submission.csv.zip
clf3 = LogisticRegression(C = 12.0) # Instantiate Logistic Regression
for label in toxicity_labels:
y = training_dataset[label]
# train model using X_train and y for each class
model3 = clf3.fit(X_train_2,y)
# compute evaluation score
y_pred = model3.predict(X_train_2)
print('-------------------------------------------------------------')
print('Classification Report for {}'.format(label))
print(classification_report(y,y_pred))
# predict on test set
y_test = model3.predict(X_test_2)
y_test_probability = model3.predict_proba(X_test_2)[:,1]
result3[label] = y_test_probability
# chain current label to X_train
X_train_2 = add_feature(X_train_2, y)
print('New shape of X_train:',X_train_2.shape)
# chain current label predictions to X_test
X_test_2 = add_feature(X_test_2, y_test)
print('New shape of X_test:',X_test_2.shape)
result3_values = result3[comment_column_names].values
# transform into required format
y_cc_lr = list()
for prediction in result3_values:
temp_list = list()
for item in prediction:
if item > float(0.5):
temp_list.append(1)
else:
temp_list.append(0)
y_cc_lr.append(temp_list)
y_test = final_test_dataset[comment_column_names].values
print(classification_report(y_test, y_cc_lr))
print('Hamming Loss for Classifier Chain Classifier using Logistic Regression is: {:.2f}'.format(hamming_loss(y_test, y_cc_lr)))
print('Accuracy for Classifier Chain Classifier using Logistic Regression is: {:.2f}'.format(accuracy_score(y_test, y_cc_lr)))
clf4 = MultinomialNB() # Instantiate Naive Bayes
for label in toxicity_labels:
y = training_dataset[label]
# train model using X_train and y for each class
model4 = clf4.fit(X_train_2,y)
# compute evaluation score
y_pred = model4.predict(X_train_2)
print('-------------------------------------------------------------')
print('Classification Report for {}'.format(label))
print(classification_report(y,y_pred))
# predict on test set
y_test = model4.predict(X_test_2)
y_test_probability = model4.predict_proba(X_test_2)[:,1]
result4[label] = y_test_probability
# chain current label to X_train
X_train_2 = add_feature(X_train_2, y)
print('New shape of X_train:',X_train_2.shape)
# chain current label predictions to X_test
X_test_2 = add_feature(X_test_2, y_test)
print('New shape of X_test:',X_test_2.shape)
result4_values = result4[comment_column_names].values
# transform into required format
y_cc_nb = list()
for prediction in result4_values:
temp_list = list()
for item in prediction:
if item > float(0.5):
temp_list.append(1)
else:
temp_list.append(0)
y_cc_nb.append(temp_list)
y_test = final_test_dataset[comment_column_names].values
print(classification_report(y_test, y_cc_nb))
print('Hamming Loss for Classifier Chain Classifier using Multinomial Naïve Bayes is: {:.2f}'.format(hamming_loss(y_test, y_cc_nb)))
#the output should be read as Multinomial Naïve Bayes instead of Logistic Regression
print('Accuracy for Classifier Chain Classifier using Multinomial Naïve Bayes is: {:.2f}'.format(accuracy_score(y_test, y_cc_nb)))
#the output should be read as Multinomial Naïve Bayes instead of Logistic Regression
Neural Networks such as LSTM (Long Short Term Memory) are very good at understanding the context.
Bidirectional LSTM runs the inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backward, hence it preserves the information from the future and using the two hidden states combined we are able in any point in time to preserve information from both past and future.
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(training_dataset['cleaned_comment_text'].values)
from tensorflow.keras.preprocessing import text, sequence
X_train_seq = tokenizer.texts_to_sequences(training_dataset['cleaned_comment_text'].values)
X_train_padded = sequence.pad_sequences(X_train_seq, maxlen=100)
X_test_seq = tokenizer.texts_to_sequences(final_test_dataset['cleaned_comment_text'].values)
X_test_padded = sequence.pad_sequences(X_test_seq, maxlen=100)
def clean_text_for_embeddings(comment):
# strip newlines if any
comment = comment.rstrip('\r\n')
# remove non alpha numeric character
comment = re.sub('[^\w\s]', ' ', comment)
# convert to lowercase
comment = comment.lower()
return word_tokenize(comment)
The below cells for embeddings creation have been commented out to prevent generation of new embeddings again.
# cleaned_training_comments = training_dataset['cleaned_comment_text'].tolist()
# cleaned_test_comments = final_test_dataset['cleaned_comment_text'].tolist()
# cleaned_comment_text_for_embeddings = cleaned_training_comments + cleaned_test_comments
# create embedding tokens
#embedding_tokens = list()
#for comment in cleaned_comment_text_for_embeddings:
# if len(comment.split()) !=0:
# embedding_tokens.append(clean_text_for_embeddings(comment))
# store the embeddings
#with open('generated_embeddings.pickle', 'wb') as f:
# pickle.dump(w2v_dict, f, pickle.HIGHEST_PROTOCOL)
import pickle
# Load the embeddings dictionary
with open('generated_embeddings.pickle', 'rb') as w2v_file:
embedding_dictionary = pickle.load(w2v_file)
print("length of word embeddings: ", len(embedding_dictionary.keys()))
word_index = tokenizer.word_index
print('Total unique words %s' % len(word_index))
words_max = len(word_index)+1
embedding_matrix = np.zeros((len(word_index)+1, 100))
for word, i in word_index.items():
if i >= words_max:
continue
embedding_vector = embedding_dictionary.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
callback = tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 5)
def build_model():
embedding_dim = 100
model = Sequential()
embedding_layer = Embedding(words_max,embedding_dim,weights=[embedding_matrix],input_length=100,trainable=False)
model.add(embedding_layer)
model.add(Bidirectional(LSTM(64, return_sequences=True) ))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(64, return_sequences=False)))
model.add(Dropout(0.2))
model.add(Dense(64 ,activation='relu'))
model.add(Dense(6, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
bi_lstm_model = build_model()
bi_lstm_model.summary()
plot_model(bi_lstm_model)
y = training_dataset[comment_column_names].values
bi_lstm_model.fit(X_train_padded, y, batch_size=32, epochs=5, validation_split=0.2, callbacks=[callback])
y_pred_biLSTM = bi_lstm_model.predict([X_test_padded],verbose=1)
# transform into a format
bi_lstm_result = list()
for prediction in y_pred_biLSTM:
temp_list = list()
for item in prediction:
if item > float(0.5):
temp_list.append(1)
else:
temp_list.append(0)
bi_lstm_result.append(temp_list)
print(classification_report(y_test, bi_lstm_result))
print('Hamming Loss for Bi-LSTM Classifier is: {:.2f}'.format(hamming_loss(y_test, bi_lstm_result)))
print('Accuracy for Bi-LSTM Classifier is: {:.2f}'.format(accuracy_score(y_test, bi_lstm_result)))
The table below summarizes the high-level performance of the models on the test dataset. The table's naïve glance suggests that our binary relevance logistic regression model is the best as it has the highest f1-score and comparable accuracy. However, upon analyzing the different models' classification report, we can see that our Logistic Regression and Multinomial Naïve Bayes based model have abysmal classification performance.
From the individual classification reports, we can see these classifiers are majority classifiers as they are doing a good work of predicting Class 0 (non-toxic) comments but either overpredict or have a low precision when classifying toxic labels.
The deep learning model also has a very low F1-score; however, the accuracy is high. Upon analyzing the classification report of the Bi-LSTM model, we can see that it completely misses two toxic subcategories resulting in a low F1-score. We can attribute this low score to the prevalent class imbalance in our dataset. The class labels were not stratified while doing the training/validation split. However, the performance of the model on the other toxic subcategories is high.
Hence, we choose Bi-LSTM model is our best model, whose classification performance can further be enhanced by using data augmentation techniques and stratified splitting.
| Model Name | Hamming Loss | F1-Score (Macro Average) | Accuracy |
|---|---|---|---|
| OneVsRest Logistic Regression | 0.05 | 0.49 | 0.84 |
| OneVsRest Multinomial Naïve Bayes | 0.04 | 0.03 | 0.90 |
| Binary Relevance Logistic Regression | 0.03 | 0.53 | 0.88 |
| Binary Relevance Multinomial Naïve Bayes | 0.03 | 0.38 | 0.90 |
| Classifier Chain Logistic Regression | 0.03 | 0.52 | 0.88 |
| Classifier Chain Multinomial Naïve Bayes | 0.06 | 0.39 | 0.85 |
| Bi-LSTM Neural Network | 0.03 | 0.38 | 0.89 |
To summarize, through this project, we learned how multi-label classification works and its available methods. The machine learning concepts and algorithms such as classification, Naïve Bayes, Neural Networks, NLP techniques learned during our class helped us with the execution of the project, and we were able to apply them according to build the models and evaluate their performance. The best model (Bi-LSTM) is a good baseline model that could be employed to classify specific toxic subcategories. However, it requires additional training for the categories which have fewer data points.